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Abstract 

Motivated by learning problems including max-norm regularized matrix com- 
pletion and clustering, robust PCA and sparse inverse covariance selection, we 
propose a novel optimization algorithm for minimizing a convex objective which 
decomposes into three parts: a smooth part, a simple non-smooth Lipschitz part, 
and a simple non-smooth non-Lipschitz part. We use a time variant smoothing 
strategy that allows us to obtain a guarantee that does not depend on knowing in 
advance the total number of iterations nor a bound on the domain. 



1 Introduction 

We propose an optimization method (PRISMA — PRoximal Iterative SMoothing Al- 
gorithm) for problems of the form^] 

min{F(x) := f(x) + g{x) + h{x) (1.1) 

where: 

• / : R™ — > K is a convex and ^/-smooth function, that is, differentiable with a 
Lipschitz continuous gradient: ||V/(x) — V/(y)|| < Lf \\x — y\\ Va;,y 6 
R n . 

'All the theorems hold also in general Hilbert spaces, but for simplicity of exposition we consider a 
Euclidean setting. 
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• g : W 1 — > K is a convex p g -Lipschitz continuous function: \f(x) — f(y)\ < 

• /i : M. n — »• R U {+00} is a proper, lower semicontinuous, convex (but possi- 
bly non-continuous/non-finite) function. For example, h could be an indicator 
function for a convex constraint. 

We further assume that we can calculate gradients of /, and that g and h are "simple" 
in the sense that we can calculate their proximity operator^ 

( \\x — u\\ 2 1 

prox (x, a) = argmm < h g(u) : u g K > , 

u { 2a J 

where x € R", a e and similarly for h. That is, our method can be viewed 

as a black-box one, which accesses /, g and h only through the gradients of / and the 
proximity operators of g and h. Each iteration of PRISMA requires one evaluation of 
each of V/, prox 9 and prox^ and a few vector operations of overall complexity O(n), 
and after k such iterations we have 



F(x k+1 ) - F(x*) = 



Lj_ P g lOg k 

k 2 k 



Applications. Our main motivation for developing PRISMA was for solving opti- 
mization problems involving the matrix max-norm (aka 72 : £ 1 ->£ oc norm). The max- 
norm has recently been shown to possess some advantages over the more commonly 
used trace-norm Lee et al. |2010j, Jalali and Srebro [2012|, Foygel and Srebro | 2011| , 
but is lagging behind it in terms of development of good optimization approaches — a 
gap we aim to narrow in this paper. Both norms are SDP representable, but standard 
SDP solvers are not applicable beyond very small scale problems. Instead, several first- 
order optimization approaches are available and commonly used for the trace-norm, 
including singular value thresholding Cai et al. |2008 | and other Jaggi and Sulovsky 



l2^T0l , |MTetlIl | |2^09l , [Tomioka et al.||2010| . However, until now there have been no 
practical first-order optimization techniques available for the max-norm. In Sec. [4]we 
show how max-norm regularized problems can be written in the form \\.\\ and solved 
using PRISMA, thus providing for the first time an efficient and practical first order 
method for max-norm optimization. 

We also demonstrate how PRISMA can be applied also to other optimization prob- 
lems with complex structure, including robust principal component analysis (robust 
PCA), sparse inverse covariance selection and basis pursuit. In particular, for robust 
PCA, we show how PRISMA yields better performance then previously published ap- 
proaches, and for basis pursuit we obtain the best known convergence rate using only 
first-order and proximity oracles. 

Our Approach. Following ideas of Nesterov Nesterov [ 2005b | and others, we 
propose to smooth the function g and use its proximity operator to obtain gradients 
of its smoothed version. However, unlike Nesterov 1 2005b |, where a fixed amount of 

2 For more clarity here we explicitly define the proximity operator as a function of a too because our 
algorithm heavily depends on this parameter changing over time. 
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smoothing is used throughout the run, we show how to gradually change the amount of 
smoothing at each iteration. We can then use an accelerated gradient descent approach 
on / plus the smoothed version of g, as in Nesterov 1 2005b |. We also use the ideas of 
partial linearization to handle the component h: instead of attempting to linearize it as 



in gradient descent, we include it as is in each iteration, as in FOBOS Duchi and Singer 



p009| and ISTA/FISTA |Beck and Teboulle| f2009|. The gradual smoothing allows us 
to obtain a guarantee that does not depend on knowing in advance the total number of 
iterations, needed in Nesterov [2005b |, nor a bound on the domain, needed in Nesterov] 
]2005b) , IChambolle and Pock| | |20 1 1 1 , and paying only an additional log(fc) factor. 

Notation. We use /* to denote the Fenchel conjugate, Sq the indicator of the set 
Q (zero inside Q and infinite outside), S" , the set ofnxn positive definite matrices, 
and S™ the set of p.s.d. ones. 



2 Smoothing of Nonsmooth Functions 



As was suggested by Nesterov [2005a b| and others, in order to handle the non-smooth 



component g, we approximate it using a smooth function. Such a smoothing plays 
a central role in our method, and we devote this section to carefully presenting it. In 
particular, we use the (3-Moreau envelope Bauschke and Combettes [201 1 1, known also 



as Moreau-Yosida regularization. For a function tp : — > K U {+oo}, the Moreau 
envelope ipp, where f3 > 0, is defined as 



ipp(x) := inf | " ^ " + tp(u) : u € E"| Vx G R" . 

The function cpp is a smooth approximation to tp (in fact, the best possible approxima- 
tion of bounded smoothness), as summarized in the following Lemmas: 



Lemma 2.1 (Proposition 12.29 in |Bauschke~a nd Com bettespOl 1| ). Let tp : R n -t R 

be a proper, lower semicontinuous, convex function and /3 > 0. Then tpp is ^-smooth 
and its gradient can be obtained from the proximity operator of (p as: \7(ipp)(x) = 
i(x-prox v (x,/3)). 

Lemma 2.2. Let ip : K n convex, p-Lipschitz function, then 

1. if (3 > 0, then <pp < tp < <pp + |/3p 2 ; 

2. ifP>P'> 0, then <pp < <pp, < <p. p + - (3')p 2 . 



Proof. The proof extends the one of Lemma 2.5 in Shalev-Shwartz et al. [2010 1 to vec- 
torial functions. Define ^ x {u) — -^\\x — u\\ 2 +ip(u). We have <pp{x) = inf u fy x (u) < 
^x(x) = <p(x), and this proves the left hand side of the first property. For the other 
side of the inequality, we have 

**(u) = yj\\x - u\\ 2 + ip(u) - ip(x) + tp(x) 
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where we have used the Lipschitz property of tp. Hence 

f\\x - u\\ 2 

<P[>(x) > V{x) + inf I — p\x - u 

= tp{x) - \p P 2 . 

The second property follows from the first one and (<Pp>)p_p, = Pp (Proposition 12.22 
in Bausc hke and Co mbettes [ 201 1| ). ■ 

These two lemmas show how to obtain a smooth approximation of a Lipschitz con- 
tinuous function, with the tradeoff between the smoothness and the degree of approxi- 
mation controlled by the Lipschitz constant of the non-smooth function. Furthermore, 
in order to be able to work with the smoothed approximation p.p, and in particular 
calculate its gradients, all that is required is access to the proximity operator of the 
non-smooth function tp. This is the reason we require access to prox g . 

Relation to Nesterov Smoothing. The Moreau envelope described above also 
underlies Nesterov's smoothing method Nesterov 1 2005b |, although his derivation is 
somewhat different. Nesterov refers to a non-smooth function ip that can be written as: 

p(x) = max{(j;. u) — tp(u) : u € Q} (2.1) 

where x € W 1 , Q C K m is a bounded closed convex set, and proposes to smoothen the 
function ip o A, where A : K™ — > W n is a linear operator, with tpp o A, where: 

tpp(x) :— max{(x,u) — <p{u) — j3d{u) : u € Q}, (2.2) 

with x £ K™, (3 > 0, and d(u) > f||it — iio|| 2 - The smoothness and approximation 
properties of (pp are then given in terms of the diameter of the set Q, the parameter a 
and a certain norm of the matrix A. Any convex function <p can indeed be written in the 
form p.l) , where tp is its convex conjugate and Q is the set of all subgradients of tp — 
the diameter of Q thus corresponds to the Lipschitz constant (bound on the magnitude 
of the subgradients) of tp. Focusing on A being the identity and d(u) = ^||u|| 2 (i.e. 
a = 1), we have that pp defined in ( |2.2| > is exactly the Moreau envelope tpp: 

Proposition 2.3. If tp is a proper, lower semicontinuous convex function, tp is its 
Fenchel conjugate, Q the set of all its subgradients and d(u) — |||m|| 2 , then tpp = pp. 

This is obtained as a corollary from a slightly more general Lemma: 

Lemma 2.4. If tp is a proper, lower semicontinuous convex function which can be 
written as ( |2.1| > and d(u) = \ ||u|| 2 , then tpp — (tpO &q) p' where fOg is the infimal 
convolution of f and g. 

Proof. From the definition of tp and [Bausc hke and C ombettes, |201 1| Thm. 13.32], 
we derive that tp = tp* — Sq. Hence 



<pp(x) = max\{x,u) - tp*(u) - §|M| 2 } 

u£Q [ Z J 



<P* + ^\\-\\ 2 + 6q) (x) = (tpn5* Q )p(x), 
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Algorithm 1 PRISMA 



Parameters {f3 k > : k £ N} 



Initialize x\ = yi € dom/i, 0! = 1,L]_ = Lr J 1 



for k = 1, 2, . . . do 



4L 

1 + , 1 



a; fe+ i <- prox,, ((1 - 2^-)tffc - r^V/(y fc ) + ^ prox ff (y fc , f3 k ) , i 

yfe+i <- Xk+i + ®k+i - i) {xk+i - %k) 
end for 



where the last equality follows from |Bauschke and Combettes 2011 Thm. 13.32, 
Prop. 13.21 and Prop. 14.1]. ■ 

Nesterov's formulation is somewhat more general, allowing for different regulariz- 
es d(u) and different sets Q, but our presentation is a crisper and more direct statement 
of the assumptions on the function tp to be smoothed and its relationship to <pp: we just 
need to be able to calculate the proximity operator of tp, to obtain gradients of tpp, and 
we need to rely on its Lipschitz constant in order to be able to control the quality of the 
approximation. 

3 PRoximal Iterative SMoothing Algorithm 

Our proposed method, PRISMA, is given as Algorithm [T] As is standard in "accel- 
erated" first order methods (e.g. |Beck and Teboulle] ]2009[ ), we keep track of two 
sequences of iterates, x k and y k . At each iteration we perform a proximal gradient 

update x k+1 <- prox h (y k - j^V(/ + gp k )(y k ) , jH, where the gradient of g Pk is 
calculated according to Lemma [2~T| and f3 k is some sequence of parameters. We then set 
y k +i to be a linear combination of x k+ i and x k . Note also that the recursive formula 
for the sequence 9 k satisfies 

J. L = ^±lJ_. (3 .l) 

We establish the following guarantee on the values of the objective function. 

Theorem 3.1. If the sequence j3 k is nonincreasing, then the iterates of Algorithm^ 

satisfy, \fx* £ dom h, 

F{x k+ i) - 

In the following we show two possible ways to set the sequences j3 k . One possi- 
bility would be to decide on a fixed number of iterations T and to set f3 k to a constant 
value. 
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Corollary 3.2. For a given T > set /3 k = ^rjiy Vfc Then for every x* e dom ft 



F(s r+1 ) - F(a;*) < 2 Mg^M! + 2 ^IK - *i|| 



(T + l) 



T + l 



This choice is optimal if we know the number of iterations T we will use, as well 
as the Lipschitz constant and a bound on the distance from a minimizer, sharing the 



same limitation with Nesterov [2005b |. But with this fixed choice of smoothing, even 
if we perform additional iterations, we will not converge to the optimum of our original 
objective (only of the smoothed, thus approximate, objective). 

We propose a dynamic setting of (3 k , that gives a bound that holds uniformly for 
any number of iterations, and thus a method which converges to the optimum of the 
original objective. 



Corollary 3.3. Let a > and set f3 k 

k €E N and for every x* € dom h 



in Algorithm [7j then, uniformly for any 



F(x k+1 ) - F(x*) <2 



Lf + ak 
(k + 1) 2 




3 , Lr+ak 1 
— log / + — — — 
la Lf + a Lf + a 



Proof. It is easy to derive that < 9 k < ^xy (by induction) and the elementary 
chain of inequalities V fc , — h-^ = -4r + o —rr < ^rr + fi rt dx. The 

" L^d%— 1 a+bi q+o a+m — a+b Jl a+ox 

assertion then follows from Theorem 1 3. II and these facts. ■ 



Note that the optimal choice of a does depend on the Lipschitz constant p g and on 
the distance of our initial point from a minimizer, but that the algorithm does converge 



for any choice of a. Thus, compared to the bound in Corollary 3.2 the price we pay 
for not knowing in advance how many iterations will be needed is only an additive 
logarithmic facto^j 



When / = the updates reduce to x k +i <- prox^ (prox 9 (y fc , i), ^), and 



ik 



jr. This can be viewed as an adaptive version of an accelerated backward-backward 
splitting algorithm. 

Before proving Theorem |3.1| we need two additional technical lemmas. 

Lemma 3.4 (descent lemma, Thm. 18.15 in Bauschke and Combettes [201 1|, Thm. 
2.1.5 in Nesterov [2004]). The convex function f is L-smooth if and only if 



f(x)<f(y) + (x-y,Vf(y)) + ^\\x-y\\ 2 Vx,y e 



^Ouyang and Gray 1 2012a] use a similar proof technique, for the simpler case when h = 0, and they 
report a convergence rate bound without the logarithmic term. Unfortunately their proof contains an error; 
they fixed it in the arxiv version of their paper Ouyang and Gray 1 2012b]. Their new theorem contains the 
term maxfe \\x* — x k || 2 , which is not known how it can be bounded. 
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Lemma 3.5 (Thm. 2.1.2 in Nesterov [2004)). For a /i-strongly conv 'ex function g, and 
v = argming, 



g(w)-g(v) > 



Proof, (of Theorem 3.1 1 Denote F' 3 * (x) :— f(x) + gp k (x) + h(x). Fix an arbitrary 
k 6 N. Since / + gp h is L^-smooth by Lemma 2.1 applying Lemma [3~4| yields 



F f) *{x k+1 )<f{y k )+g Pk {y k )+ I f\ 



(V{f + 9p h ){Vk),Xk-\ 



%k+i - Vk\\ 2 
- Vk) + h(x k +i) 



L k I 



We now use the strong convexity of the function x H> (V(/ + g/3 k )(y k ),x) 
J/fe|| 2 + h(x), whose minimizer equals x k +i (by Lemma 2.1 1. Using Lemma 3.5 
w = (1 — 9k)xk + 6 k x, we obtain 

F 0k (s/b+i) < f(y k ) + gp k (y k ) + h((l- 6 k )x k + 6 k x) 

+ (V(/ + gp h ){Vk), (1 ~ &k)x k + k x - y k ) 



with 



Li 



- e k )x k + e k x - y k f 



||(1 - 6 k )x k + 6 k x - x k+1 \\' 



We let z k := j^y k + [1 — g^J x k for every k 6 N and note that z k+ i = x k 
-g^ixk+i — Xk) by the algorithmic construction of y k +\. Therefore, 

F h (x k+1 ) < f(y k ) + g 0k {y k ) + h((l- 9 k )x k + 9 k x) 

+ (V(/ + gp k )(yk), (1 - k )x k + k x- Vk ) 

+ ^\\*-Zkf- 6 ^\\x-z k+1 t 

< fivk) + gp k (yk) + (i - o k )h(x k ) + e k h(x) 
+ (V(/ + gp k ){y k ), (i - k )x k + 6 k x - y k ) 

x-z k \\ =— \\x - z k+1 \\ 



2 11 " 2 
< {l-6 k )F h {x k )+0 k F^{x) 

I "fc^*ll 1 1 2 OkLkn ||2 

+ -^— 2~ II 2*>+i|| , 

where we have used the convexity of the functions /, g, h. Using Property 1 in Lemma 
I2.2l we have 

F h {x k+1 ) - F(x) < ^* (||ar - z k f - \\x - z k+1 \\ 2 ) 



(l-6 k )(F^(x k )-F(x)) 
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We now use Property 2 in Lemma 2.2 to change the smoothing parameter. Denoting 

by D k = F Pk (x k ) - F(x), we obtain 



D k+1 1 , 2 (p k -p k+1 )p> i-e k 

^< 2 (||X-Z fe || -11,-^1)+ wlLk 4 -A 



Using the definition of 8 k ( |3 - 1 1 >, and summing we obtain 



D 



k+l 



eiL k 



(A-A+i)p : 



Finally, we apply Property 1 in Lemma 2.2 to obtain F(x k +i) < F^ k+1 (x k +i) + 
^Pk+iPg an d reordering the terms we obtain that 



1 



1 



1 



D k+1 < -elL k \\x - Xl \\ 2 + -(3 k+lP 2 g + -B\L k 



(ft - ft+i)p 



Applying the definition of 9 k ( |3.1| l and gathering the ft terms, we obtain the stated 
bound. ■ 



Relation to Prior Work 



Teboulle 



2009 



With g = 0, PR ISMA with fixed smoothing reduces to a variant of FISTA | Beck and 
, with essentially the same guarantee of 0(t£ 



Conversely, with h = 0, or h being an indicator for a convex domain, PRISMA 
becomes similar to Nesterov's smoothed accelerated method |Nesterov, 2005b|, ob- 
taining the rate of ), but with some important differences. First, we provide 
an explicit bound in terms of the Lipschitz constant of g, as discussed in Sec. [2] Sec- 
ond, Nesterov's method relies on a fixed domain and involves two projections onto the 
domain at each iteration, whereas PRISMA uses only a single projection. Moreover, 
Nesterov's methods and guarantees rely on the domain being bounded and require that 
the number of iterations must be fixed in advance, in order to set its parameters. On the 
other hand, the dynamic tuning of f3 k in PRISMA allows us to obtain a guarantee that 
depends only on ||a;* — x%\\, and with parameters (i.e. (3 k and 9 k ) which do not have 
to depend on it, or on a fixed number of iterations. Unbounded domains are frequently 
encountered in practice, for example, in our motivating application of max-norm opti- 



mization. The excessive gap primal-dual algorithm in Nesterov 1 2005a | improves over 



the one in Nesterov 1 2005b | because it does not need to fix in advance the number 
of iterations, but it shares the same shortcoming on the assumption of the bounded 
domain. 

The primal-dual algorithm in Chambolle and Pock 1 201 1| can minimize functions 
of the form g(Kx)+h(x), where K is a linear operator, and it assumes to have access to 
piox g and prox^. The original problem is transformed into a saddle point problem and 
they obtain a O(r) convergence rate for the primal-dual gap, for non-smooth objective 
functions. However this kind of guarantee will translate to a guarantee on the functional 
objective value only assuming that g and h* are Lipschitz (see also the discussion on 
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this point in Loris and Verhoeven | 2011) ). Moreover it is not clear how the obtained 
bound explicitly depends on the Lipschitz constants and the other relevant quantities. 

A different approach is given by the Alternating Direction Method of Multipli- 
ers (ADMM) algorithms |Boyd et al. 201 1| , that optimize the augmented Lagrangian 
associated to the optimization problem, updating alternatively the variables of the prob- 
lem. ADMM approaches have the drawback that constraints in the optimization prob- 
lem are not necessarily satisfied at every iteration. This makes them ill suited to prob- 
lems like max-norm matrix completion, where two matrices would be generated at each 
iteration, only one of the two being PSD, and for none of the two a convergence rate 
bound is known. Recently, Goldfarb et al. p012[ , [Scheinberg et al. [ 2010[ presented 
an ADMM method, called Alternating Linearization Method (ALM), which uses alter- 
nate projections of an augmented Lagrangian. It is applicable when g = 0, that is, for 
objectives which can be decomposed into a smooth plus a "simple" function, and it has 
the same rate of convergence of FISTA. It can also be used with non-smooth functions, 
using the smoothing method proposed by Nesterov [ 2005b) , and hence it shares the 
same shortcomings. 

Another recent related method | Loris and Verhoeven 201 1| is applicable when 
h = and / is quadratic. This method differs significantly from both PRISMA and 
Nesterov's aforementioned methods, in part because no 0k and no affine update of yk 
are required. The applicability of Loris and Verhoeven | |201 1[ is not as general as 
PRISMA (for example it does not apply to max norm regularization or basis pursuit) 
and it is not clear whether it can be extended to any smooth function /. Similar to 
PRISMA, it attains an0(|) rate without requiring boundedness of the domain. 

PRISMA combines the benefits of many of the above methods: we can handle 
functions decomposable to three parts, as in the max-norm optimization problem, with 
an unbounded domain, and without having the parameters to depend on the distance to 
a minimizer nor to fix a priori the number of iterations, at the cost of only an additional 
log-factor. This is an important advantage in practice: setting parameters is tricky and 
we typically do not know in advance how many iterations will be required. 



4 Applications 



4.1 Max Norm Regularized Matrix completion 

The max norm (also known as the j2-.ei-ye 00 norm) of a matrix X is given by ||X|| max = 
vc\fx=uv T ||^||2,oo||^||2,oo) where ||Z7||2,oo is the maximal row norm of U. The max 
norm has been suggested as an effective regularizer for matrix completion (i.e. collab- 
orative filtering) problems, with some nice theoretical and empirical advantages over 



the more commonly used nuclear norm (or trace norm) Srebro et al. [2005 1, Foygel 
and Srebro | 2011| , Lee et al. 1 20 1 0| . It has also been recently suggested to use it in 
clustering problems, where it was shown to have benefits over a nuclear-norm based 
approach, as well as over spectral clustering Jalali and Srebro | 2012) . However, the 
max norm remains much less commonly used than the nuclear norm. One reason 
might be the lack of good optimization approaches for the max norm. It has also been 
recently brought our attention that Jaggi | 2011) suggested an optimization approach 
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for mini|x|| maa .<i /(-X") which is similar in some ways to these first order methods, 
and which requires O iterations, but where each iteration consists of solving a 

max-cut type problem instead of an SVD — the practicality of such an approach is not 
yet clear and as far as we know it has not been implemented. Instead, the only practical 
approach we are aware of is a non-convex proximal point (NCPP) approach, without 
any performance guarantees, and relying on significant "tweaking" of learning rates 
Lee et al.] | |2.010| . Here we present the first practical first order optimization method for 



max norm regularized problems. 

To demonstrate the method, we focus on the matrix completion setting, where we 
are given a subset £1 of observed entries of a matrix M G M mx ™ and would like to infer 
the unobserved entries. Using max norm regularization, we have to solve the following 
optimization problem: 

min \\\W\\ max + \\Pn(W) - Pn(M)\\ 2 F . (4.1) 
w 

where Pq is a projection onto the entries in (zeroing out all other entries). Expressing 
the max norm through its semi-definite representation we can rewrite ( |4.1| i as 

min 
A,B.x 



{\maxdiag(£v *) + \\P n {X) - P n (M)\\ 



Setting f(X) = \\Pq(X) ~ P Q (M)\\ 2 F , g(X) = Xmeocdiag(X) and h(X) = 
8 &m +n (X), we can apply PRISMA. The proximity operator of h corresponds to the 

projection onto the set of (m + n) x (m + n) positive semidefinite matrices, that can 
be computed by an eigenvalue decomposition and setting to zero all the negative eigen- 
values. The proximity operator of g amounts to thresholding the diagonal elements of 
the matrix. 

To demonstrate the applicability of PRISMA, we conducted runtime experiments 
on the MovieLens 100K Datasej^] selecting an equal and increasing number of users 
and movies with the most ratings. We used training data equal to 25% of the total 
number of entries, selected uniformly at random among the ranked entries. We im- 
plemented PRISMA in MATLAB. We stopped PRISMA when the relative difference 
k \\~x k \\ F k was l ess man 10 • The computation bottleneck of the algorithm is the 
computation of the eigenvalue decomposition at each step. So, to speed-up the algo- 
rithm, we implemented the strategy to calculate only the top c eigenvalues, where c is 
defined as min(m + n, (# eigenvalues > at previous iteration) + 1). If the smallest 
calculated eigenvalue is positive, we increase c by 5 and recalculate them, otherwise we 
proceed with the projection step. This strategy is based on the empirical observation 
that the algorithm produces solutions of decreasing rank, and practically this proce- 
dure gives a big speed-up to the algorithm. This kind of strategy is common in the 
matrix completion literature and RPCA, see e.g. |Goldfarb et al.| | [2009| . We compared 
our algorithm to SDPT3 Tiitiincii et al. 1 2003) , a state-of-the package for semi definite- 



quadratic-linear programming. For simplicity the parameter A was set the same in all 



^ http : //www . grouplens -org/node/73 
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SDPT3 


PRISMA 


Matrix Size 


Cost 


Time 


Cost 


Iter. 


Time 


100x100 


1078.76 


13s 


1079.00 


7366 


124s 


150x150 


2617.65 


89s 


2618.11 


7866 


218s 


200x200 


4795.76 


364s 


4796.97 


8908 


281s 


250x250 


7736.76 


1323s 


7738.59 


9524 


404s 


300x300 






11406.39 


9843 


578s 



Table 1 : Performance evaluation of PRISMA and SDPT3 on subsets of the MovieLens 
100K dataset. 



the experiments, equal to 0.2 jOJ that guarantees reasonable good generalization per- 
formance for all the sizes considered. The parameter a of PRISMA was set to a rough 

estimate of the optimal value, that is, to (m+^p^M)^- ■ ^he winning times and the 
final value of the objective function are listed in Table [T] SDPT3, a highly optimized 
SDP solver relying on interior point methods, indeed has very good performance for 
relatively small problems. But very quickly, the poor scaling of interior point methods 
kick in, and the runtime increases dramatically. More importantly, SDPT3 ran out of 
memory and could not run on problems of size 300x300 or larger, as did other SDP 
solvers we tried. On the other hand, the number of iterations required by PRISMA to 
reach the required precision increases only mildly with the size of the matrices and the 
increase in time is mainly due to the more expensive SVDs. We could run PRISMA 
on much larger problems, including the full MovieLens 100K data set. In Figure[T]we 
report the objective value as function of time (in seconds) for PRISMA on this data 
set. The time of each PRISMA iteration decreased over time, as the rank of the iterate 
decreases, ranging from 30 seconds of the first iterations to 2 seconds of the last ones, 
yielding a total runtime of several hours on the entire dataset. In contrast, it is incon- 
ceivable to use generic SDP methods for such data. On the other extreme, non-convex 
proximal-point (NCPP) methods are indeed much faster on this (and even much larger) 
data sets, taking only about ten seconds to reach a reasonable solution. However, even 
running NCPP for many hours with different step size settings and regardless of the 
allowed dimensionality, NCPP was not able to reach the global minimum, and was al- 
ways at least 4% worst than the solution found by PRISMA — the minimum objective 
value attained by NCPP is also plotted in Figure [T] PRISMA thus fills a gap between 
the very pricey SDP methods that are practical only on tiny data sets, and the large- 
scale NCPP methods which require significant parameter tweaking and produce only 
rough (even if sometimes satisfying) solutions. 



4.2 Robust PCA 



In the method of pmdes et al.| Q201 1| | for robust PCA, a data matrix M € M n i x ™2 
given and the goal is to solve 



min{||W|| tr + A||S||i : W + S — M, W, S e 



This problem is of the form (TT) with the choice / = 0,g(W) = X\\M-W\\i,h(W) = 
\\W\\t r - We used the RPCA method to solve the task of background extraction from 
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Movielens 100k - Entire Matrix 




Figure 1: Objective value for PRISMA as a function of iterations in log scales. 



surveillance video. By stacking the columns of each frame into a long vector, we get 
a matrix Al whose columns correspond to the sequence of frames of the video. This 
matrix M can be decomposed into the sum of two matrices M = W + S. The matrix 
W, which represents the background in the frames, should be of low rank due to the 
correlation between frames. The matrix S, which represents the moving objects in the 
foreground in the frames, should be sparse since these objects usually occupy a small 
portion of each frame. 



We have compared our MATLAB implementation of PRISMA to ALM Gold- 



farb et al. [ 2009 , 2012 1, where this problem is solved with an augmented Lagrangian 
method, smoothing both the trace and the l\ norm. We further used the ALM continu- 
ation strategy described in Goldfarb et al. 1 2009 1, that is usually beneficial in practice 
even though no theoretical iteration count guarantees are known for it. The parame- 
ter a of PRISMA, analogously to max-norm experiments, is set to 
compared it to the static strategy, setting /3& to constant ti mes a. 
In our experiments, we used two videos^] introduced in 



X^/n 1 n 2 

\\M\\i 



We also 



Li et al. 



1 2004) , "Hall" and 



"Campus" with the same parameters used in Goldfarb et al. | 2009| . Since for all al- 
gorithms the complexity for each update is roughly the same, and dominated by an 
SVD computation, we directly compare the number of iterations in order to reduce the 
hardware/implementation dependence of the experiments. The average CPU time per 
iteration for the algorithms and datasets are reported in Table [2] The results are shown 
in Figure [2] On both image sequences PRISMA outperforms the static algorithm, uni- 
formly over the number of iterations. Notice also that for some constant settings the 



algorithm does not converge to the optimum, as expected from Theorem 3.1 Moreover 
it also outperforms the state-of-the-art algorithm ALM. 



http : //perception .i2r.a-star.edu. sg/bk_model/bk_index . html 



12 





Campus 


Hall 


PRISMA (static & dynamic) 


3.5 ±0.1 


1.9 ± 0.01 


ALM 


5 ± 0.3 


2.66 ±0.1 



Table 2: Time (in seconds) per step, average ± standard deviation. 



4.3 Sparse Inverse Covariance Selection 



In sparse inverse covariance selection (SICS) - see Wainwright et al. [2007], Yuan and 
|Lin| ]2007[ and references therein - the goal is to solve 

min -logdet(Jf) + (E,X) + A||X||i, (4.2) 

where Eg S™ is a given positive semidefinite matrix. This problem is of the form 
(fTT) with the choice / = (£, -),g = X \\ ■ ||i, h = -logdet(-) + 5 S » (•). There- 
quired proximity operator of h at X 6 S™ , can be easily computed by computing a 
singular value decomposition of X and solving a simple quadratic equation to obtain 
each singular value of the result. We have compared PRISMA with two state-of-the-art 
approaches: ALM [Scheinberg et al. |20 10], which achieves stats-of-the-art empiri- 
cal results, and ADMM Boyd et al. 1 20 1 1 j^J a recently popular general optimization 
method that performs very well in practice, but lacks iteration complexity guarantees. 

ALM uses a complex continuation strategy that requires tuning of four different 
parameters - we used the data-set specific parameter settings provided by [Scheinberg 
et al.| | |2010") . For ADMM we used the default parameters. For both ALM and ADMM 



there is no clear theory on how to set their parameters. To set the a parameter of 
PRISMA we need a rough estimate of ||X*||f. The optimal solution of ( |4.2| i will be 
very close to a diagonal matrix, so we approximate its optimum of adding the constraint 
X = al, and it is easy to see that the optimal a is jt^. Using the fact that p g = An, 
we set a = A(l + X)y/n in all experiments. That is, both for ADMM and PRISMA we 
did not specifically tune any parameters. 



We experimented on same five gene expression network data sets from Li and Toh 



1 2010], which were also used by Scheinberg et al. | 2010| , and used the same setting 



A = 0.5. 

The results on the datasets are shown in Fig. |3]7] Again we show the objective 
function in relation to the number of iterations, being the complexity per step of the 



algorithms roughly the same. The average times per iteration are reported in Table 3]4 
We can see that the performance of PRISMA is pretty close to the one of ALM, while 
ADMM results in a slower convergence. Notice that the difference between ALM 
and ADMM is probably mainly due to the continuation strategy used in ALM, whose 
parameters are been tuned on these datasets. On the other hand PRISMA just has one 
parameter, whose optimal setting is given by the theory. 



6 We used the MATLAB code available at |http : / /www . Stanford ■ edu/ ~boyd/ papers /| 
admm/ 
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Hereditary breast cancer 


Lymph 


PRISMA 


11.74 ±0.75 


0.33 ±0.02 


ALM 


7.5 ±0.09 


0.27 ± 0.05 


ADMM 


13 ±0.23 


0.47 ± 0.06 



Table 3: Time (in seconds) per step, average ± standard deviation. 





Estrogen receptor 


Arabidopsis thaliana 


Leukemia 


PRISMA 


0.72 ±0.04 


1.15 ±0.09 


3.61 ± 0.19 


ALM 


0.45 ±0.03 


0.75 ±0.04 


2.46 ±0.05 


ADMM 


0.80 ±0.02 


1.40 ± 0.05 


4.15 ±0.09 



Table 4: Time (in seconds) per step, average ± standard deviation. 



5 Discussion and Future work 

We have proposed PRISMA, a new algorithm which can be used to solve many convex 
nonsmooth optimization problems in machine learning, signal processing and other ar- 
eas. PRISMA belongs to the broad family of proximal splitting methods and extends 
in different ways forward-backward splitting, accelerated proximal, smoothing and ex- 
cessive gap methods. PRISMA is distinguished from these methods by its much wider 
applicability, which allows for solving problems such as basis pursuit or semidefinite 
programs such as max norm matrix completion. 

We have validated our method with experiments on matrix completion, robust PCA 
problems, and sparse inverse covariance selection, showing that a simple to code im- 
plementation of PRISMA can handle large scale problems and outperforms or be equal 
in efficiency to current state of the art optimization algorithms. But PRISMA can be 
useful also for other learning-related problems. 

For example, Basis Pursuit (BP) |Chen et al. 1 2001 1, Guigue et al. |2005 | can also 



be solved with PRISMA. The BP optimization problem is formulated as 

min {||x||i : Ax — b, x E R d } , 

where A £ R mxd , b £ M. m are prescribed input/output data. In most applications, the 
sample size is much smaller than the dimensionality, rn <C d. This problem is of the 
form \\A) with the choice / = 0, g = || • ||i, h = £4, where A = {x g R d : Ax = b}. 
BP can be rephrased as a linear program, which is then solved with standard approaches 
(simplex, interior point methods, etc.), or it can be solved with ADMM, but in both 
cases no complexity results are known. It is easy to see that, for every x G R d , its 
projection on A can be written as Proj^(a;) = x — A T z, where z is any solution of 
AA T z = Ax — b, hence we can use PRISMA to solve this problem. The complexity 
of the projection step is 0(dm), since m <§; d. It would be interesting to see whether 
PRISMA would indeed yield empirical advantages here, as well as in varied other 
learning and optimization problems. 
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jure 2: Comparison of the objective values as a function of the number of iterations, 
log scales. 
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ure 3: Comparison of the objective values as a function of the iterations. 
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ure 4: Comparison of the objective values as a function of the iterations. 
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ure 5: Comparison of the objective values as a function of the iterations. 
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ure 6: Comparison of the objective values as a function of the iterations. 
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SICS - Leukemia 
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Figure 7: Comparison of the objective values as a function of the iterations. 
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